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^1^ Abstract 

^' . Regularized kernel methods such as, e.g., support vector machines 

^ and least-squares support vector regression constitute an important 

^ class of standard learning algorithms in machine learning. Theoretical 

c/) investigations concerning asymptotic properties have manly focused 

on rates of convergence during the last years but there are only very 
,-H few and limited (asymptotic) results on statistical inference so far. As 

J> this is a serious limitation for their use in mathematical statistics, the 

'^ goal of the article is to fill this gap. Based on asymptotic normal- 

^V.^ ity of many of these methods, the article derives a strongly consistent 

—^ estimator for the unknown covariance matrix of the limiting normal 

distribution. In this way, we obtain asymptotically correct confidence 
sets for ip{fp,\o) where /p,Ao denotes the minimizer of the regularized 
^^ risk in the reproducing kernel Hilbert space H and xjj : H ^^ R™ is any 

1-H Hadamard-differentiable functional. Applications include (multivari- 

kJ ate) pointwise confidence sets for values of /p.Aq ^nd confidence sets 

• ^H for gradients, integrals, and norms. 

Jh Keywords: Asymptotic confidence sets, asymptotic normality, least-squares 

support vector regression, regularized kernel methods, support vector ma- 
chines. 
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1 Introduction 

Regularized kernel methods constitute an important class of standard learn- 
ing algorithms in machine learning theory. The prominent learning algo- 



rithms support vector machine (for classification) and (least- squares) sup- 
port vector regression (for regression) also belong to this class; see, e.g., |28| . 
[17], and [21]. While these methods are standard in machine learning the- 
ory and are widely applied, their propagation in mathematical statistics is 
still limited. This is partly due to the fact that there is a lack of results 
on statistical inference for these methods so far. In machine learning the- 
ory, the goal in a supervised learning problem is to find a "good" predictor 
f : X ^ y which maps the observed value x G Af of an input variable X to 
a prediction of the unobserved value y G 3^ of an output variable. A learning 
algorithm Sn is a mapping which maps a set Dn of observed training data 
(xi,yi), . . . , {xn-iVn) to a predictor f^,^. In mathematical statistics, such a 
problem would rather be called a (nonparametric) regression (or a classi- 
fication) problem, X is the covariate, Y is the response variable, Sn is an 
estimator and Sn{Dn) = fon is the estimated function. In both contexts, 
a "good" predictor/estimate / has a small expected loss (also called risk) 
T^L,p{f) = J L{x,y, f{x)) P{d{x,y)) where L is a "suitable" loss function 
and P is the joint distribution of X and Y. However, depart form the differ- 
ent terminology, there is also a real difference: In machine learning, the goal 
is to find any predictor / which has a small risk and, accordingly, a learning 
algorithm Sn should be risk-consistent, i.e., TZL^p{Sn{Dn)) converges to the 
infimal risk inf j TZL,p{f) for n — )• oo. In statistics, it is e.g. common to make 
a signal plus noise assumption such as 1" = fo{X) +g{X)e and the goal is to 
estimate the unknown regression function /q. Under suitable assumptions, 
/o minimizes TZL^p{f) in certain sets of functions /. While, in machine 
learning, one is mainly interested in minimizing the risk, in statistics, one is 
mainly interested in the minimizer and, accordingly, an estimator 5„ should 
be consistent in the sense that Sn converges to /q. For statistical inference, 
it is also crucial to have estimates for the error of the estimator, e.g., in order 
to obtain confidence sets or hypothesis tests. While consistency results for 
the risk, e.g. [19], [29], [20], and [5], and also for the functions, e.g. [22l §3] 
and [lit Cor. 3.7], are well-known for regularized kernel methods, there are 
only very few and limited results concerning statistical inference. In order 
to fill this gap, asymptotic confidence sets for a wide class of regularized 
kernel methods are developed in the following. This is possible now because 
[ID] derives asymptotic normality of these methods and, based on this re- 
sult, estimating the error of the estimate gets tractable. Let /d„,a„ be the 
(nonparametric) estimate obtained by a regularized kernel method (A„ is a 
data-driven regularization parameter), fix any Aq G (0,oo), and let /p,Ao be 



the minimizer of the regularized problem 

/ ^ nLAf) + MIWh (1) 

in the function space H (a so-called reproducing kernel Hilbert space). Ac- 
cording to [ini Theorem 3.1], under some assumptions, the sequence of 
function- valued random variables \/n(/D„,An ~ Ipm) weakly converges to 
a Gaussian process in H. As a consequence, for differentiable functions 
ip : H -)■ R"" , it follows that 



n{^[jD„,A„ 



^(/p,Ao)) -^ AA„(0,Sp) 



In order to obtain asymptotic confidence sets, e.g., for the vector of values 
fp,Xoixj)^ J ^ {1,---,W> one has to choose ^p{f) = (/(xi), . . . , /(x^)) 
and it only remains to estimate the asymptotic covariance matrix T,p. The 
derivation of a consistent estimator is not a trivial task and is the main issue 
of the article. However, pointwise confidence sets for the true values of /p,Ao 
are not the only possibility to directly apply the results of the article. We 
also obtain confidence sets e.g. for integrals of /p,Ao (choose ipif) = /g f d\) 
or for the differential of /p,Ao in ^ point xq (choose ipif) = df{xQ)) and 
many others. Essentially, it is only needed that ■0 takes its values in IR™ for 
any m G IN and is suitably differentiable. 

Note that we are only able to derive asymptotic confidence sets for the un- 
known solution /p,Ao of the regularized problem (jTl). Of course, it would 
be desirable to obtain asymptotic confidence sets for the minimizer of the 
unregularized risk TZl^p. However, in our completely nonparametric setting 
(P is totally unknown) , this would require a uniform rate of convergence of 
the learning algorithm/estimator to the minimizer of TZl^p (if such a mini- 
mizer exists at all) and it is well-known from the no-free-lunch theorem [8] 
that such a uniform rate of convergence does not exist. That is, similar 
results for the minimizer of the unregularized TZl,p can only be obtained 
under substantial assumptions on the unknown distribution P. 
Accordingly, the approach in the present article which focuses on applica- 
tions in statistical inference considerably differs from the approach common 
in machine learning theory which focuses on (as fast as possible) rates of 
convergence of the risk, e.g., [23], [3], [2j, [23], [16]. This approach considers 
large classes V of probability measures for which learning rates, e.g., in the 
form 

P"(7^L,p(/D„,AJ - inf 7^L,p(/) < cp^s ■ n-") >l-5, 

exist and where the rate of convergence /? > does not depend on P and the 
infimum is taken over all measurable functions / : Af — )• R. Such learning 



rates are an important tool in order to compare theoretical properties of 
different learning algorithms. However, these results cannot be applied off- 
hand for statistical inference in real applications because the constant cp^s is 
usually unknown. Furthermore, the focus lies on maximizing /3 which, typi- 
cally, results in an increase of the constant cp^s so that the bound cp^s ■ n~^ 
might be large for ordinary sample sizes n. In addition, whether a proba- 
bility measure belongs to V is often subject to assumptions which are hard 
to communicate to practitioners and to be satisfactorily checked or made 
plausible in applications. A common assumption is, e.g., Tsybakov's noise 
assumption [25l p. 138]. 

The present article derives asymptotic confidence sets for ^(/p,Ao) based on 
the asymptotic normality results of [lOj. So far, there are only very few 
publications which are concerned with statistical inference for regularized 
kernel methods. In the special case of classification by use of the hinge loss 
and linear SVMs (i.e. linear kernel), asymptotic normality of the coefficients 
of the linear SVM is shown in [15] under a number of regularity assump- 
tions (e.g. existence of continuous densities). Though this could yield an 
alternative way of deriving asymptotic confidence sets in this special case, 
this has not been done so far. In the special case of classification by use of 
the hinge loss and SVMs with finite-dimensional kernels (i.e. a parametric 
setting), |13] shows asymptotic normality of the prediction error estima- 
tors and derive confidence intervals for the prediction error of the empirical 
SVM. In the special case of regression by use of least-squares support vector 
regression, [6] proposes approximate confidence intervals for the regression 
function whose derivation is partly based on heuristics; it is not documented 
whether these intervals approximately hold the intended confidence level in 
simulated examples. 

In the following Section [2| some basics of regularized kernel methods are 
recalled. The main part of the article, Section[3j consists of two subsections: 



Subsection 3.1 derives an asymptotically consistent estimator of Ep and 



asymptotic confidence intervals; Subsection 3.2 shows how the calculation 
of the estimator can be done in a computationally tractable way. All proofs 
are given in the appendix. 

2 Regularized Kernel Methods 

Let (Jl, A, Q) be a probability space, let A:" be a closed and bounded subset 
of IR'^, and let 3^ be a closed subset of R with Borel-cr-algebra OS (3^) . The 



Borel-(T-algebra oi X x y is denoted hy ^{X x y). Let 

Y,,...,Yn : {n,A,Q) -^ {y,ny)) 

be random variables such that {Xi, Yi), . . . , {Xn, Yn) are independent and 
identically distributed according to some unknown probability measure P 
on {X X y,'S{X X y)). Define 

D„ := ((Xi,yi),...,(x„,y„)) VneW. 

A measurable map L : X x y x H ^ [0, oo) is called loss function. A loss 
function L is called convex loss function if it is convex in its third argument, 
i.e. t I— 7- L{x, y, t) is convex for every (x, y) & X x y. Furthermore, a loss 
function L is called P-integrable Nemitski loss function of order p € [1, oo) 
if there is a P-integrable function b : X x y ^- R and a constant c G (0, c«) 
such that 

\L{x,y,t)\ < b{x,y) + c\t\P y {x,y,t) £ X x y xR . 

If 5 is even P-sguare-integrable, L is called P-sgware-integrable Nemitski loss 
function of order p E [1, cxd). The risk of a measurable function / : A" — )• R 
is defined by 

^L,p(/) = / L{x,y,f{x))P{d{x,y)) . 
Jxxy 

The goal is to estimate a function f : X ^ R which minimizes this risk. 
The estimates obtained from regularized kernel methods are elements of 
so-called reproducing kernel Hilbert spaces (RKHS) H. An RKHS H is a 
certain Hilbert space of functions / : Af — )• R which is generated by a kernel 
k : X X X ^ R. See e.g. HZ], [I], [21], or [l2] for details about these 
concepts. 

Let H be such an RKHS. Then, the regularized risk of an element f £ H is 
defined to be 

T^L,p,x{f) = T^lAI) + MlfWn , where A e (0,oo) . 

An element f £ H is denoted by fp^\ if it minimizes the regularized risk in 
H . That is, 

nL,p{fp,x) + a||/p,a|||, = inf (7^L,p(/) + a||/|||,) . 



The estimator is defined by 

Sn : (Af x3;)"x(0,oo) ^ H, (D„„ A) ^ /d„,a 

where /^^ x is that function f & H which minimizes 



1 " 

-5^L(x„y„/(x.))+ All/Ill, (2) 



n . 

4=1 



in H for D„ = ((a;i,a;2), ■ ■ • , (iCn,yn)) G (<Y x 3^)'". The estimate /^i^^a 
uniquely exists for every A S (0, oo) and every data-set D„ S (^ x J^)" if 
1 1—)- L(x, y, t) is convex for every (x, y) £ X x y. 

In the article, the symbol -^ denotes weak convergence of probability mea- 
sures or random variables. 

3 Asymptotic Confidence Intervals 

3.1 Theory 

The derivation of asymptotic confidence sets is based on the result in [TU] 
that, under some assumptions, 

\/^(/d„,a„ - /p,Ao) --> Hp in H 

where Hp is a mean-zero Gaussian process in H and An is a random reg- 
ularization parameter (e.g. data-driven). Therefore, the same assumptions 
as in [To] are needed; they are collocated in the following: 



Assumption 3.1 Let PC C W^ be closed and bounded and let y <Z ^ be 
closed. Assume that k : X x X ^ '^ is the restriction of an r - times con- 
tinuously differentiable kernel k : R*^ x M!^ — )• R such that r > d/2 and 
k ^ 0. Let H be the RKHS of k and let P be a probability measure on 
{X X y ,^{X X y)) . Let 

L : Af X 3^ X R -^ [0, oo) , {x,y,t) ^^ L{x,y,t) 

be a convex, P-square-integrable Nemitski loss function of order p G [l,oo) 
such that the partial derivatives 

L'{x,y,t) := —{x,y,t) and L"{x,y,t) := -^{x,y,t) 



exist for every {x,y,t) £ X x y x H . Assume that the maps 

{x,y,t) h-> L'{x,y,t) and {x,y,t) ^ L"{x,y,t) 

are continuous. Furthermore, assume that for every a G (0,oo), there is a 
b'^ £ L2{P) and a constant 6" G [0,oo) such that, for every {x,y) £ X x y, 

sup \L'{x,y,t)\ < b'^{x,y) and sup \L"{x,y,t)\ < b'^ . (3) 

t&[—a,a] t&[—a,a] 

These assumptions are relatively mild. In particular, the assumptions on 
k are fulfilled for all of the most common kernels (e.g. Gaussian RBF ker- 
nel, polynomial kernel, exponential kernel, linear kernel). Though assuming 
differentiability of the loss function is an obvious restriction (as it does not 
cover some of the most popular loss functions as hinge, epsilon-insensitive, 
and pinball), this assumption is not based on any unknown entity such as 
the model distribution P . Therefore, a practitioner can a priori meet this 
requirement by a suitable choice of the loss function (e.g. the least-squares 
loss for regression, the logistic loss for classification (or smoothed versions of 
hinge, epsilon-insensitive, and pinball). This is contrary to the assumptions 
common in order to establish rates of convergence to the infimal risk. Typ- 
ically, the assumptions used there depend on the unknown P so that they 
can hardly be checked in applications and are mathematically involved so 



that they can hardly be communicated to practitioners. In Assumption 3.1 
the only assumptions on P are integrability assumptions, which are natural 
as such assumptions are necessary even for ordinary central limit theorems. 



Explicit examples where Assumption 3.1 is fulfilled are given in Section |4] 



Under these assumptions, we have asymptotic normality: 



Theorem 3.2 11 (A Theorem 3.1] Let Assumption 3.1 be fulfilled. Then, for 



every Aq € (0, oo), there is a tight, Borel-measurable Gaussian process 

Up : n -^ H , u) ^ Hp(a;) 

such that, 

v^(/d„,a„ - /p,Ao) -^ Hp in H (4) 

for every Borel-measurable sequence of random regularization parameters A„ 
with 

-v/n(A„ — Ao) > in probability . 

The Gaussian process Hp is zero-mean; i.e., E(/, Hp)// = for every f G 
H . 



Recall that a map ^ : H ^ R™" is Hadamard differentiable at some fo £ H 
if and only if there exists a ip'^ = {ip'r ^, . . . ,ip'r ^) G H"^ such that, for 
every sequence i£ \ in R, and for every sequence hi ^ h m H, 



lim 



V4'^> 



u 



0. 



The element Vf £ -f^"* is called derivative of ■0 at /q. For h G H and 
V'j = (V'/q d ■ • • ) V'fn m) ^ -^'"' *^^ expression {ip't , h)H denotes the element 
of R™ whose components are given by (V'^ j,h)H, j S {1, • • • , m-}- 
By a routine application of the functional delta method |27| Theorem 3.9.4], 
we get the following corollary: 



Corollary 3.3 Let Assumption 3.1 he fulfilled, let Aq € (0,cxd), and let 
tp : H ^ R™" he Hadamard- dijjerentiahle in fp^Xg with derivative ip'r 
Then, there is a covariance matrix Sp G j^mxm gy^^j^ that, for every Borel- 
measurahle sequence of random regularization parameters A„ with 



(An - Ao) 



-^ 



in prohahility , 



it holds that 



V'I/d,, 



^(/p,Ao)) -^ AA„(0,Sp) 



The limit A/'m(0, Tip] 
given hy Q). 



is equal to the distribution of (^ip'r , Mp^n where Hp 



ts 



Accordingly, in order to derive asymptotic confidence intervals, the main 
issue which remains to be solved is to calculate or rather consistently esti- 
mate the covariance matrix Sp . In principle, Sp is completely known if 
P is known - as can be seen from the proof of Theorem 3.2 given in jlOj . 
This suggests to estimate Sp by a plug-in estimator where P is replaced by 
the empirical measure Pd„ • However, this is a challenging task because Hp 
is given by Hp = S'p{Gp) where S'p is a (complicated) continuous linear 
operator and Gp is a random variable which takes its values in a large func- 
tion space. Hence, calculating Sp = Cov(('i/'f > ^p)h) means to calculate 
an integral with respect to a measure on that function space. Fortunately, 
this can be avoided as follows from Prop. 3.4 There, Sp is specified in 



a way which is more accessible to a plug-in estimator. The consistency of 



the resulting plug- in estimator is given in Theorem 3.6 Note that Sp can 
be degenerated to in Corollary |3.3[ In order to derive asymptotic confi- 
dence sets, degeneracy has to be excluded by adding additional assumptions 



(Assumption 3.8) below. 



Proposition 3.4 Let Assumption 3.1 he fulfilled, let Aq G (0,oo), and let 



ip : H ^ R'" he Hadaniard-differentiahle in fp^x^ with derivative tp'r 
Define 

9PM- A'x3^^]R, {x,y) ^ -L'{x,yJp,Xo{x)){^P'f^.,Kp^{Hx)))H 



where Kp denotes the continuous linear operator defined in {19^ . Then, the 
covariance matrix Sp in Corollary \3.3\ is equal to 



Cov{gp,x,(Xi,Y,)) . (5) 



It follows from Prop. 3.4 that Sp could be estimated by the standard co- 



variance estimator for the ]R"*-valued i.i.d. random variables 

9P,\oi^i^Yi)^ • • • ' 9P,Xo{Xn, Yn) 

if P was known. However, as P is unknown, we have to replace P by the 
empirical measure and ip'r by an estimator ip'-^ ^ of ip'r . Then, we 
may estimate Sp by the non-i.i.d. random variables 

9Dn,An{^l:^l): ■ ■ ■,9Dn,Ani^n,Yn) 

where 

9T>„,Anix,y) = -L' {x,y, fTi„^Anix)){tp'-D„,A„,K^l^A^i'^{x)))H (6) 

and -fi'D„(cj),A„(w) ■ H ^ H is the continuous linear operator given by 

1 " 

i^D„,Aj/) = 2A„/ + -^L"(X„y,,/D„,A„(X,))/(X,)<I>(X,) (7) 

^z — / / 

i=l 

for every f & H. The following theorem states that the resulting plug- 
in covariance estimator is strongly consistent. It is also shown that the 
estimator is measurable. This is not obvious as the proof of Theorem |3.2| is 
based on the theory of empirical processes and the map Dn i— )• Pd„ (which 
maps a set of data to its empirical measure as an element of a certain function 
space) is typically not Borel- measurable; see e.g. [271 § l-l]- 

Assumption 3.5 Let ijj : H ^ R™ he Hadamard-dijferentiable at fp^ 
with derivative ih'r and let th'-r, ^ he an estimator of th'r which is 
strongly consistent, i.e., 

W-Dn,Ar.-'^'fp,xJ\w^ ^^^^ ^ almost surely. (8) 



Theorem 3.6 Let Assumption 3.1 and Assumption 3.5 be fulfilled. Fix 



Aq € (0,cxd) and let Up G jj^mxm ^g ^^g covariance matrix in Corollary 



3.3. Then, for every Borel-measurahle sequence of random regularization 



parameters A„ with 

-v/n(A„ — Ao) > almost surely , 

the estimator 

1 " 



n 



with 



1 "" 
gr,„,A„{Xi,Yi) := gD„,ASXi,Yi) - -^9r>„,ASXj,Yj) Vf G {1, . . . ,n} 

i=i 
is measurable and strongly consistent, i.e., 

J]„(D„,A„) > Sp almost surely. 

The following remark specifies a natural candidate for an estimator of ip'r ; 
the proof is given in the appendix. 

Remark 3.7 Ifip is Hadamard-differentiable at every f £ H with derivative 



ip'r and if f >-^ ip't is continuous, then Assumption 3.5 is fulfilled for the 
estimator 

'Ad,,, An -■= V'/o^.A^ 

- provided that A^ converges to Aq almost surely. 

The calculation of the estimator $]„(D,^, A„) for a given data set is an issue of 
its own because it is burdened by the fact that we have to solve n equations 

i^D„,An(/.) = HX^), iG{l,...,n}, 

in the typically infinite dimensional function space H. As we will see in 



Subsection 3.2 below, this problem can satisfactorily be solved (Prop. 3.10). 
In fact, these equations can be solved jointly, essentially by calculating the 
Moore- Penrose pseudoinverse of an n x n-matrix once only. 



In order to derive asymptotic confidence intervals based on Corollary 3.3 
it is desirable that the covariance matrix Sp has full rank. Lemma |7.2| in 
the Appendix yields that this can be achieved by the following two weak 
conditions: 



10 



Assumption 3.8 Assume that, for Px{dx) -a.e. x ^ X, 

3yi,y2 ^supp{P{dy\x)) s.t. L'{x,yi, fp^Xoix)) ^ L'{x,y2, fp,Xo{x)). (9) 

For every j G {1, . . . ,m}, let tp'r ■ G H denote the j-th component of 
ih'f and assume that 

il)'r ^j.-.jip'r ^ are linearly independent on sup];){Px) ■ (10) 



Due to continuity, Assumption (10) can be reformulated to the following 
condition: 



a^^'-' 



r^' =0 P^-a-s. for some a G R™ => a = 0. (11) 

As we will see in the examples in the applications section, Assumption |3.8| 
indeed provides weak and simple conditions. E.g., in case of the least- 
squares loss or the logistic loss, Assumption ^ is equivalent to assuming 
that P{dy\x) is not a Dirac measure. 

From the above results and assumptions, it follows that 

SO that we get elliptical confidence sets which are asymptotically correct: 



Theorem 3.9 Let Aq G (0,oo) and let Assumption 3.1, Assumption 3.5 



and Assumption 3.8 he fulfilled. Let A„ he a sequence of Borel-measurahle 
random regularization parameters with 

\/n{Kn — Ao) > O almost surely . 



2 



Fix any a G (0,1), let Xma ^^ ^^^ (1 ~ a)th quantile of the chi-squared 
distrihution with m degrees of freedom and 

C„,,(D„,A„) :={iz;G]R™| ||S„(D„, A„)-^ («^ - ^(/d„,aJ) ||r™ < %}• 

Then, 

Q{i^{fp,Xo) e C„,a(D„,A„)) ^_^^> I- a. 

Note that the confidence set C„^q,(D„,A„) is an ellipsoid in H"^ which is 
centered at V'(/d„,a„) and whose principal axes are given by 



\ -Vi, ..., \ • Vm 

y n y n 

where 71 , . . . , 7^ are the eigenvalues and vi, . . . ,Vm are corresponding or- 
thonormal eigenvectors of the matrix S„(D„, A„). 

11 



3.2 Computation of Asymptotic Confidence Sets 

The calculation of the estimator S„(D„, A„) for a given data set is burdened 
by the fact that we have to solve every of the following n equations 

Ku^AAh) = HX^), iG{l,...,n}, 

in the typically infinite dimensional function space H. In particular for a 



large sample size n, this seems to be problematic. However, Prop. 3.10 below 
yields that the problem can essentially be reduced to the calculation of a 
single Moore-Penrose pseudoinverse of an n x n-matrix after the following 
preparation: Let Dn = ((xi,yi), . . . , (x„,yn)) G (^ x y)^. Then, there 
is always a maximal subset {$(xij), . . . , ^{xi^)} of {^{xi), . . . , ^{xn)} such 
that $(xjj, . . . , $(xj^) are linearly independent - i.e. {^{xi-^), . . . , <l>(xj,,)} 
is a basis of the vector space spanned by $(xi), . . . , <l'(x„). Accordingly, for 
every i G { 1 , . . . , n} , there are l^u, . . . , f3ri G H such that 

r 

'^{xi) = Y,(3,Mxi,). (12) 

i=i 

Define 

/ /3ll ... /3ln \ 

Bn^=\ ■.-..■. \ G R^x". (13) 

\ Prl ■ ■ ■ Prn / 

E.g., in case of a Gaussian RBF kernel, vectors $(xjj, . . . ,$(xi^) are lin- 
early independent if and only if all Xj^ , . . . , xi^ differ; see e.g. [TTJ Theorem 
2.18]. Hence, in this case, finding Bo^ only means to identify all ties in the 
covariates - and, if there are no such ties, Bq^ is just the n x n- identity 
matrix. 



Proposition 3.10 Let Assumption \3.1\ he fulfilled. Fix any set of data 
Dn = {{xi,yi),...,{xn,yn)) € {X X y)"^ and any A G (0,oo). Define Bd„ 



according to (12) and (13). Let L'^ ^ G R"^" denote the diagonal matrix 



with diagonal entries 

L"{xi,yi, fD^,\{xi)),. . . , L"{xn, Vn, fD^,\{Xn)), 

define the n x n-matrix 

k{xi,xi) . . . k{xi,Xr, 



^D„,A — 2A . Idnxn H • Ln x 

n 



KyXrijXl) ... Ki^XnjXn) 

12 



and let {B£)^Ad„,x) be the Moore-Penrose pseudoinverse of Bd^Ad^^x. 
Then, for every x £ X and y £ y, 

1 " 

and 
gD,,,\{x,y) = -L'{x,y,fD^,x{x))-(—'il^'f^^^{x) + J2'^i(^)'^'fDr,,>.^^i'l 



2X 



where 



ai{x)\ / L"{xi,yi,fD„,x{xi))k{xi,x) 

\an{x)) \L" [Xn, Xn, fD„,x{Xn))k{Xn, X 

By use of this proposition, the calculation of the estimator S„(D„,A) is 
unproblematic. According to its definition, it is enough to calculate the 
values gD„,\ixi,yi), i £ {l)---)''^}) and, in order to do this, the matrices 
Bf)^ and ^d„,a have to be defined and (B^i^Ad^x)' has to be calculated 
once only. Then, all values gDn,x{xi,yi), i £ {1, . . . ,n}, can simultaneously 
be calculated by matrix calculus. After that, it only remains to calculate the 
inverse of the matrix S„(D„, A) in order to calculate the elliptical confidence 
set. In order to obtain the principal axes of the ellipse, one only has to 
calculate an (orthonormal) eigendecomposition of S„(D„, A) instead. 

4 Applications 



In Subsection 3.1, a general scheme is developed how to derive asymptotic 
confidence sets for values V'(/p,Ao) of functionals ijj : H ^ R™. This general 
scheme is exemplified in a few possible applications from which it can also be 



seen that the assumptions made in Subsection 3.1 are moderate and, equally 
important, not mathematically involved so that they are comprehensible to 
practitioners. 

The input and the output space. Let X C M!^ be closed and bounded and 
let 3^ C IR be closed. That is, the setting covers regression with 3^ = II and 
classification with y = {—1, +1} as well. 

The kernel k . Let k : R"^ x R*^ — >■ R be a kernel which is r - times continuously 
differentiable kernel where r > d/2. Let A: : *¥ x Af — ?■ R be the restriction 
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of k on X X X. Let k ^ 0. That is, every of the most common kernels can 
be chosen: a Gaussian RBF kernel, a polynomial kernel, the linear kernel, 
the exponential kernel, or sums and products of such kernels. 

The loss function L. We exemplarily consider the following three settings: 

(A) Regression with the least-squares loss: Let 

L{x,y,t) = {y-tf y{x,y,t)eXxyxR 
and assume that EY^ < oo. 

(B) Regression with the logistic loss: Fix a constant a > and define 

4exp(V) 

and assume that El"^ < oo. 

(C) Classification: Let y = { — 1,+!} and choose the least-squares loss 

L{x,y,t) = {l-ytf y{x,y,t) G XxyxR 
or the logistic loss 

L{x, y, t) = log (l + exp(y - t)) y {x,y,t) £ X x y x R . 



L{x,y,t) = -a -log- \^,_!,.2 ^ ix,y,t) £ X x y x R 



In every of these settings. Assumption 3.1 is fulfilled. Furthermore, ([9| in 
Assumption |3.8| can be rewritten as 



Var(y|x) / for Px{dx)-a.e. xeX . (14) 

If Var(y| x) = for some x £ X, then Y is deterministically fixed by 
X = X. Of course, Var(y| x) = for some x £ X can happen at most in 
case of heteroscedastic (or even more complicated) error terms. In addition 
to ([9|, the only remaining assumption is Assumption (10), which we have 
to take care of when choosing a functional ip. 

The regularization parameter An- The regularization parameter can be ran- 
domly chosen, e.g. by use of any data-driven method (cross validation etc.). 
The only requirement is to make sure that \/n(A„ — Aq) — > almost surely 
for n — )• oo. A simple way to fulfill this condition for any data-driven method 
is to choose a (possibly large) constant c S (0, oo) and to modify the method 



in such a way that it picks a value from [Aq , Ao+c/-y/nln(n) ]. Note that it is 
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indeed possible to use the same data for choosing the regularization param- 
eter as for building the final estimate - just as usually done by practitioners, 
e.g., when applying cross validation. 

The functional ip. With these choices and assumptions, the asymptotic 
confidence set (Theorem 3.9) is valid for every functional ij) : H ^ R™" which 



is Hadamard-differentiable at /p,Ao and fulfills (10). In the following, some 
concrete examples for ip are listed or even worked out in detail. In most 
cases, ih is continuous and linear so that the derivative tp'f is exactly 
known as it does not depend on the unknown fp^Xg- If ip't is exactly 



known, then Assumption ( 10 ) can be checked in real applications by use of 



f^n)) 



the following "test" : Define the m x n-matrix 

where xi, . . . ,Xn are the observed values of the input variables. If Assump- 
tion (10) is violated, then the probability that ^ has rank m (i.e. full rank 
for n > m) is equal to 0. (This follows from continuity of V'/- and the fact 
that Pa'(supp(Pa')) = 1-) That is, if the observed ^ has full rank, one can 



assume that (10) is fulfilled 



/G^. 



Example 1: Pointwise confidence intervals 
Fix some xi, . . . , Xm G X and define 

V'(/) = (/(xi),...,/(x„))\ 

Since ip : H ^ H™ is continuous and linear, ■0 is continuously Hadamard- 
differentiable. The derivative is given by 

V} = mxi),...,^xm)y G F"^, fGH. 

Condition ( [Io| ) can be checked as described above. Since 

ip'f{x) = {k{x,xi),...,k{x,Xm)y 
it follows from Prop. [3T0| that 



VxGAf, feH, 



2A 



^k{x, xi) + X;r=i ai{x)k{Xi,xi] 



9D„,ASx,y) = -L'(x,y,/D„,A„(2;))- 



^2A. 



^k{x, Xm) + Y17=i ai{x)k{Xi, x^ 



where the aj(x), i G {1, . . . ,n}, are calculated according to Prop. 3.10 Fix 



any a £ (0, 1). Then, Theorem 3.9 says that 

Q[{fp,\oixi),...,fp,Xf,ixm)) G C„,„(D„,A„) 



-^ 1 



a 



15 



where Cn_o,(D„,A„) is the elhptical confidence set as defined in Theorem 

□ 



Due to the reproducing property [21, Def. 4.18], Example 1 is a special case 
of the following example. 

Example 2: Confidence intervals for inner products 

Fix some hi, ... , hm G H which are linearly independent on the support of 

Px and define 



W) = {{f,hl)H,.--,{f,hm)Hy 



feH. 



Since ^ : H ^ R"* is continuous and linear, tp is continuously Frechet 
differentiable and the derivative is given by 

^'j = {hi,...,hmy G ^", feH, 



and condition (10) is fulfilled. It follows from Prop. 3.10 that 



gT,„,Anix,y) = -L'{x,y,fr,„^A„{x))- 



\ T^^rnix) + Ya=i ai{x)hm{Xi 



where aj(x), z G {1, . . . , n}, are calculated according to Prop. [3.10 Fix any 
a G (0, 1). Then, Theorem 3.9 says that 



Q( ((/p,Ao>^l)-f/, • • • , (/p,Ao) ^m)/;") G Cn,a(Dn,-^n) 



-)• 1 - a . 



where Cn, a (D„,A„) is the elliptical confidence set as defined in Theorem 

D 



Example 3: Confidence set for the gradient 

Fix any xq in the interior of X and, for every f £ H, let 

i^if) = dfixo) G R"^ 

be the gradient vector of / in xq. According to [21, p. 130ff], the partial 
derivative of / in xq with respect to the j'-th coordinate of x is given by 
djf{xo) = {f,dj^{xo))H- Hence, this is again a special case of Example 2 
and it follows that 



i^'fix) 



d_ 

dx 



k{x, x) 



yxex, f e H. 



x=xo 
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Again, Assumption ( 10 ) can be checked as described above. 



D 



Example 4: Confidence set for integrals 

Fix any Borel set B C X and, for every f £ H, define 

Hf) = I fdPx G ^^ . 
JB 

Tliis is again a special case of Example 2, the derivative is given by 
^'f{x) = f k{x,x)Px{dx) yxeX, f£H 



and Assumption (10) can be checked as described above. 



D 



Example 5: Confidence interval for the H-norui and for the L2-norm 
The map 



is continuously Hadamard differentiable with derivative 'i/'f = 2/ at /; see 



/ 



e.g. [71 Example 5.1.6(c)]. Condition (10) is fulfilled if /p^Aq is not Px -almost 



surely equal to 0. Hence, it is possible to construct a confidence interval for 
||/p,Aoll// according to Theorem 3.9 and, therefore, also for H/p^Aoll-ff by 



taking square roots. 

Similarly, for any B C K, , the map 



is continuously Hadamard-differentiable and the derivative at any f £ H is 
equal to Vf = /g 2/(x)$(x) dx (this follows from |2H Lemma 2.21] where 
L{x,y,t) = t^). Again, Condition (10) is fulfilled if /p,Ao is not P^"- almost 



surely equal to on i?. This can be shown by considering the RKHS which 
consists of the restrictions of the elements f G H on supp(P;f ). □ 

Similarly, to Example 4, the map f ^ \\f — fp,Xo\\'fj is continuously differ- 
entiable so that one might be tempted to apply ip{f) = \\f — /p,AoIIh in 
Theorem 13.91 in order to obtain a confidence band for the whole function 
fp,Xo ^^^ '^ot just for a finite number of points as in Example 1. However, 
this is not possible because then the derivative is given by tp'r = 2(/ — /p,Ao) 
so that ih'r = which violates (10). The mathematical reason behind is 

that, according to the continuous mapping theorem, ||-v/n(/D„,A„ — /p,Ao)|P 
weakly converges to ||Hp|||^. That is, ||/D„,An ~ /p,AoIP converges with rate 
n while the confidence sets obtained from Theorem 13.91 are based on the 
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rate \fn. By estimating quantiles of the distribution of ||Hp||^, it would be 
possible to derive confidence bands for the whole function /p,Ao- However, 
estimating quantiles of the distribution of ||Hp||^ is a matter of its own and 
cannot be done by use of the results of Subsection |3.1| - among other things 
because ||Hp|||^ is not normally distributed (as ||Hp|||^ > 0). 

5 Simulations 

5.1 Confidence sets for function values 

The model. The situation 

Yi = foiXi)+ei , i£{l,...,n} 

is considered with the regression function 

fo{x) = log(x + 2) + 0.7sin(32;) + 0.7cos(22;) . (15) 

The errors e, are drawn i.i.d. from the standard normal distribution and the 
covariates Xi are drawn i.i.d. from the uniform distribution on [0,5]. The 
simulation consists of 5000 data sets with sample sizes n equal to 250, 500, 
and 1000. The confidence sets apply to /p,Ao with Aq = 0.00001 but the Li- 
distance between /p,Ao and the actual regression function /o is approximately 
equal to 0.026 and the maximal pointwise distance is approximately equal 
to 0.091 so that the difference between /p,Ao and /o can be almost ignored 
for practical purposes here. Three kinds of confidence sets are considered: 
a univariate one for /p,Ao(2io) with xq = 3, a multivariate one for the four 
values fp^xQ^x), x G {1,2,3,4}, and a multivariate one for the seven values 
fp,Xoix), X G {1,1.5,2,2.5,3,3.5,4}. The nominal (asymptotic) confidence 
level is 0.95. 

Estimation. The regularized kernel method was applied with the Gaussian 
RBF kernel k{x,x') = exp(7||x — x'H?,^) and the logistic loss function with 
parameter a = 0.5. Following [4J and [IT, p. 9], the hyperparameter 7 of the 
kernel was fixed to 0.5 which is about the inverse of the median of the values 
\\xi — x'll^i. The regularization parameter was chosen within the values 

0.00001, 0.00005, 0.0001, 0.0005, 0.001, 0.005, 0.01 

in a data-driven way by a fivefold cross-validation. 

Perform,ance results. Table [T] lists the simulated coverage probabilities and, 

in case of the univariate confidence interval, the average length(ib standard 



18 



n 



l-diin. 
Gov. prob. (%) length 



4-dim 
Gov. prob. 1,70 



7-dim. 
Gov. prob. (%) 



250 

500 

1000 



92.7 
94.0 

94.7 



0.61±0.09 
0.44±0.04 
0.32±0.02 



91.6 
93.1 
94.5 



79.4 
91.1 
93.0 



Table 1 : Simulated coverage probability of the confidence sets obtained by 
5000 data sets in Subsection 15.11 



deviation) of the intervals obtained by 5000 data sets. Figure [T] shows the 
boxplots for the estimates of the asymptotic variance Sp of -v/^(/d„,a„ {xq) — 
fp,\o(^o)) for the different sample sizes. In addition, Figure [2] shows the plot 
of the true function /p,Ao ^^^ the pointwise univariate confidence interval 
for every x G [0,5] obtained for four different data sets with n = 500. This 
is only for illustration purposes and must not be mixed with a simultaneous 
confidence band; the band around the true function is not a simultaneous 
confidence band. 



5.2 Confidence set for the gradient 

The model. Two situations are considered, the univariate one 



Y, = foiX,)+ei 



i£ {l,...,n} , 



exactly as in Subsection |5.1[ and the multivariate one 



Yi = /o(Xi,i) + sin(1.5Xi,2) + Si 



i £ {l,...,n} 



where /o is as in ( 15 ). The errors £i are drawn i.i.d. from the standard normal 



distribution. In the univariate case, the covariates Xi are drawn i.i.d. from 
the uniform distribution on [0, 5] and, in the multivariate case, the covariates 
Xi^i are also drawn i.i.d. from the uniform distribution on [0, 5] and the 
covariates Xi^2 sue drawn i.i.d. from the uniform distribution on [—1, 1]. In 
both cases, we consider confidence sets for V'(/p,Ao) = dfp^\^{xQ) with Aq = 
0.00001 where, in the univariate case, xq = 3 and, in the multivariate case, 
xq = (3,0). Accordingly, the confidence set is an interval in the univariate 
case and an ellipse in the multivariate case. The nominal (asymptotic) 
confidence level is 0.95. 

Estimation. The regularized kernel method was applied with the Gaussian 
RBF kernel and the logistic loss function with parameter a = 0.5. Following 
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n=250 



n=500 



n=1000 



Figure 1: Boxplots for the estimation of the asymptotic variance Sp of 
^{fr>„,A„{xo) — /p,A()(a^o)) for the different sample sizes n in Subsection 



EH 
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Figure 2: Estimated pointwise 0.95-confidence intervals (grey area) for four 
different data sets with sample size n = 500 and the function /p,Ao (solid 



line) in Subsection 5.1 
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n 



1-dim. 
Gov. prob. (%) length 



2-dim 
Gov. prob. 170 



250 

500 

1000 



84.0 
90.0 
91.5 



0.27±0.23 
0.17±0.13 
0.11±0.09 



74.5 
83.4 
91.3 



Table 2: Simulated coverage probability of the confidence sets obtained by 
5000 data sets in Subsection 15.21 



[3] and [141 p. 9], the hyperparameter 7 of the kernel was fixed to 1/3 which is 

Xj— x,||2j2. The regularization 



about the inverse of the median of the values 



parameter was chosen as in Subsection 5.1 



Performance results. Table [2] lists the simulated coverage probabilities and, 
in case of the univariate confidence interval, the average length(ib standard 
deviation) of the intervals obtained by 5000 data sets. For n = 1000 in the 
multivariate case, Figure [s] shows the estimates il^{fD„,A,J = 5/D„,A„(a;o) 
obtained in the 5000 runs (gray points), the true value V'(/p,Ao) (a-s cross 
x), and the ellipse (dashed boundary) 



u;GE" 



_ 1 



'w-i^ifp^xo)) 



|2 

Ir" 



< 



A,Tn,CK I 

n J ■ 



in each plot. Asymptotically, this ellipse contains the estimate '0(/d„,a„) 
with probability 0.95. In addition, each plot shows the estimate V'(/d„,a„) 
(as black point) and illustrates the estimated covariance matrix S„(D„, A„) 
by showing the ellipse (solid boundary) 



w£ W 



\^n\^m ^r 



{w-i^{fpM)) 



|2 

Ir" 



< 



n \ 



given by the estimate S„(Dn, A„) in one of the first four runs of the simu- 
lation. 

6 Conclusions 

Regularized kernel methods constitute an important class of standard learn- 
ing algorithms in machine learning. As theoretical investigations concerning 
asymptotic properties have manly focused on rates of convergence, the lack 
of (asymptotic) results on statistical inference is a serious limitation for their 
use in mathematical statistics. Therefore, the article derives asymptotically 
correct confidence sets for '(/'(/p.Ao) where /p,Ao denotes the minimizer of the 
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-2.0 -1.5 -1.0 -0.5 





-2.0 -1.5 -1.0 -0.5 

x_1 



Figure 3: For n = 1000 in the multivariate case, each plot shows the es- 
timates ip{f-D„,An) obtained in the 5000 runs (gray points), the true value 
V'(/p,Ao) (^-s cross x), and the ellipse (dashed boundary) which asymptoti- 
cally contains the estimate V'(/d„,a„) with probability 0.95. Each of the four 
plots shows the estimate V'(/d„,a„) (plack point) and the ellipse where the 
true covariance Sp is replaced by the estimate S„(D„, A„) (solid boundary) 
in one of the first four runs of the simulation in Subsection 15.21 
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m 



regularized risk in the reproducing kernel Hilbert space H and if) : H ^^ 
is any Hadamard-differentiable functional. That is, the confidence sets do 
not apply to the minimizer /* of the unregularized risk, which would be 
the quantity of primary interest, but to the minimizer of the regularized 
risk. On the one hand, this is due to the so-called no-free-lunch theorem 
and obtaining confidence sets for /* would require a number of technical 
assumptions which can hardly be made plausible in practical applications. 
Without such assumptions, /* does not need to exist, if it exists, it does 
not have to be unique, and the rate of convergence depends on unknown 
properties. Technical assumptions can completely be avoided in this article; 
all assumptions are simple and can easily be communicated to practitioners. 



On the other hand, it is exemplified in a simulated example (Subsection 5.1 ) 
that the difference between /* and /p,Ao is negligible for practical purposes 
even for moderately small Aq > 0. 

The derivation of the confidence sets is done by use of asymptotic normality 
of a large class of regularized kernel methods and by the derivation of a 
strongly consistent estimator for the unknown covariance matrix of the lim- 
iting normal distribution. To this end, the following non-trivial problems 
had to be solved satisfactorily: (i) the derivation of a manageable formula for 
the covariance matrix, which is accessible for a plug-in estimator, (ii) strong 
consistency of the plug-in estimator, (iii) the exclusion of degeneracy of the 
covariance matrix by simple and week conditions, and (iv) the derivation of 
an algorithm for the calculation of the estimator which is computationally 
tractable also for moderately large sample sizes. 

Applications include (multivariate) pointwise confidence sets for values of 
fpM ^^"^ confidence sets for gradients, integrals, and norms. However, the 
derivation of simultaneous confidence bands is a matter of further research. 
It follows from llO, Theorem 3.11 that v^llfD^ A„ — fpAnll '^ l|Hpll 

L J J \ \\ J -'—'71 •j-'-^Tl " -^ ^''U I I QQ II II OO 

Hence, simultaneous confidence bands could be obtained if it is possible to 
derive a consistent estimator for quantiles of Up . 

7 Appendix: Proofs 



Assumption 3.1 is valid in the whole appendix. Since the results of Section [3] 



are based on results and proofs in [TO], we have to recall the quite technical 

setting from |10^ § A.l] at first. 

In order to shorten notation, define 

Lf : X ^y ^ "K, {x,y) ^ Lf{x,y) = L{x,y,f{x)) 
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for every function f : X ^ H. Accordingly, define 

L'f{x,y) = L'{x,y,f{x)) and L'f{x,y) = L"{x,y, f{x)) 

for every (x, y) ^ Xxy. As L is a P-square-integrable Nemitski loss function 
of order p G [l,oo) , there is a 6 € -^2(-P) and a constant c G (0,oo) such 
that 



Let 



\L{x,y,t)\ < b{x,y)+c\t\P W {x,y,t) G X x y xR . 



g^ ■= {g:Xxy^R\3z€ E'^+^ such that g = I(-oo,z]} 



(16) 



be the set of all indicator functions /(-oo.zl- Define cq :- 



hfbdP+l, 



Q2 ■■-- 



g-.xxy 



R 



3foeH, 3feH such that 

\\fo\\H<co, \\f\\H<l and 

9 = L',J 



and 



g ■■= ^1 U ^2 U {6} . 
Let ioo{g) be the set of all bounded functions G : G 



R with norm 



IGII 



B, 



supg^g \G{g)\ . Define 

G-.g 



R 



3 ;(/ 7^ a finite measure on A' x 3^ such that 

G{g) = fgdfi\/g€g, 

6gL2(/x), b'^£L2iii) VaG(0,oo) 



and Bq := cl(lin(i?5)) the closed linear span of Bs in £oo{G) ■ That is, Bs 
is a subset of looiG) whose elements correspond to finite measures. The 
assumptions on L and P imply that ^ — )■ IR, g *-^ j 9 ^P i^ a well-defined 
element of Bs ■ Most often, we identify an element G G Bs with its corre- 
sponding finite measure fi. That is, we write n{g) = G{g) = J gd/i for every 

g^G. 

Let /J, £ Bs- Then, 



SifJ-) ■= ffM,Xo 



arg inf 



L{x,y,f{x))^{d{x,y)) + Aq 



This defines a map S : Bs -^ H . As the multiplication by a strictly positive 
real number does not change the "arg inf , we have 



ju,x — J^u 



fJ-Ao 



S{^fi) VfiGBs, AG (0,00) 
25 



(17) 



Let fi G Bs such that fi{b) < P{b) + Aq. Then, it is shown in [TOl Theo- 
rem A. 8] that, S is Hadamard difFerentiable in fi tangentially to Bq. The 
derivative in /x is given by 



•^^H = -K^'UL'f^^^ix,y)^x)i.{d{x,y))\ yv G lin(i?5) (18) 

and 

K^: H ^ H, f ^ 2\of + JL'l^^^{x,y)f{x)^x) ,x{d{x,y)) . (19) 



Note that the integrals with respect to the finite signed measure z^ in ( 18 ) and 



the measure /i in ( 19 ) are Bochner integrals as the integrands are i7- valued 
functions. According to |10| Lemma A. 5], K^ is an invertible continuous 
linear operator and, according to [101 Theorem A. 8], the derivative S' : 
So — )• if is a continuous linear operator. The following relation between K^ 
and the random Kji^j^^ defined in (JTl) is valid: 

i^D„,A„ = ^K^^^ . (20) 



An An 



D„ 



If we identify the empirical measure Pd„((.j) ^ind P with their corresponding 
elements in ioo{G), it is shown in |10| Lemma A. 9] that 

V^{Fg^-P)^Gp in £oo(a) (21) 

where Gp : Q — >• iooiO) is a tight Borel-measurable Gaussian process. Then, 
it is shown in |10'i Proof of Theorem 3.1] that 



n 



^(/D„,An - /p,Ao) -- Hp = S'p{Gp) in H. (22) 

Proof of Corollary |3.3| According to the delta-method j27[ Theorem 



3.9.4], it follows from (22) and Hadamard-differentiability of V' in fp,Xo that 



b„,Aj-V(/p,Ao)) - (V'},,,„>Hp>j 



Since / i— )• (^ip'f , f^n is a continuous linear operator and H is a zero-mean 
Gaussian process, it follows that the limit distribution is a multivariate 
normal distribution with mean zero, i.e., the distribution of (^ijj'r ^Mp^H 

is equal to A/'m(0, Sp) for some covariate matrix Ep G R™'^™'; see e.g. [271 
§3.9.2]. D 
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Proof of Prop. 3.4: First, it is a direct consequence of the definition of 
the continuous Hnear operator Kp that Kp is self-adjoint and, accordingly, 
the inverse Kp is again self-adjoint; see [9, Lemma VI. 2. 10]. Define fj := 
Kp^{^'j^^ ■) £ H and note that 121., (5.4)] implies 



^'f^Jmnh ea. 



(23) 



Since Kp is self-adjoint, it follows for every G € \ui{Bs) with corresponding 
signed measure // that 

where (*) follows from interchangeability of Bochner integrals with contin- 
uous linear operators; see e.g. [71 Theorem 3.10.16 and Remark 3.10.17]. 
Next, it follows from continuity of Sp that 



{^'f...,,vS'p{G))H = -mH-G{L)^Jf,\\-^'f,) 



(24) 



is valid even for every G £ Bq where Bq denotes the closed linear span of 
Bs in iooiG)- Since Gp takes its values in Bq, it follows from Hp = Sp{Gp) 
now that 

{^'fp,.o'P^p)H = -\\mH-Gp{L'j^Jf,\\jlfj) VjG{l,...,m}. (25) 



According to (21) and (23) 



y Gp{L'j:^^^J\fm\\H fm) I 

where Sp is the covariance matrix of 



AA„(0,Sp) 



L'. (x,y)||/i||^Vi(^),.-.,^/,,(^,nil/mlli,Vm(x)) ; 



see, e.g., [271 P- 81f]. Let C denote the diagonal matrix with diagonal entries 

(26) 



||/i||hi . . . , ||/m||H- Then, it follows from (25) that 



Sp = Covf(V^}^^^,Hp>HJ = CtpC. 
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Since, according to the reproducing property and self-adjointness of Kp , 



= {^'fp,.,,vKp\^{x)))H ViG{l,...,m}, 

it follows that -L'f^^ WfjWllfj = WfjW'u'aPMj where gp^x^j denotes the 
j-th component oi gp^x^y Hence, Sp = CoY(C~^gp^\^{X,Y)^ so that (26) 
implies Sp = Cov(5p,Ao(^>^))- ° 



Lemma 7.1 Under the Assumptions of Theorem 3.6, the covariance esti- 
mator S„(D„, A„) is measurable with respect to A and B®"* . 

Proof of Lemma l7.lt It has to be shown that g'D„,An{Xi,Yi) is measur- 
able for every i G {1, . . . , n}. 

First, note that w i— )■ /D„(tj),A„(cj) is measurable because: for every fixed 
A > 0, the map D i— >■ fjj^x is continuous on {X x 3^)" according to [2T| 
Lemma 5.13] and, for every fixed D £ (X x 3^)"", the map A i— )• fD,x is 
continuous on (0, oo) according to fill Theorem 5.17]; hence, {D, A) i— >• /d,a 
is a Caratheodory function and, therefore, measurable, see e.g. [71 Theorem 
2.5.22]. 
Secondly, we show measurability of K^ ^ (<l>(Xj)). To this end, define 

1 " 
AD,x,g- H ^ H, f ^ -'^L"{xi,yi,g{xi))f{xi)^{xi) 

i=\ 

and 

K^^x^g : H ^ H, f ^ 2Xf + Aoxgif) 

for every D = {{xi,yi), . . . ,{xn,yn)) € {X x 3^)", A G (0,oo), and g e H. 
That is, i^D„,A„ = -^d„,A„,/d„,a„- ^he assumptions imply that, 

{Xxy)''x{0,oo)xH ^ H , {D,X,g) ^ KD,x,g{f) is continuous (27) 

for every f & H. Note that 

1 " 

{f,AD,X,g{f))H = -^L"{xi,y^,g{xi))f{xi){f,^Xi))H = 

i=l 
1 " 

-'^L"{xi,yi,g{xi)){f{xi)) >0 



n 
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because convexity of 1 1— )• L{x,y,t) implies L"{x,y,t) > 0. Hence, 

\\Kd,xM)\\h = 4X'\\ff + 2X{f,AD,xAf))H + \\Ad,xM)\\h > ^^'Wff 
for every f £ H, and this implies 

\\Kn\j < ^ y{D,X,g)e{Xxyrx{0,oc)xH. (28) 

Let the sequence {Di, Xi, gi), i £ M, converge to some {D, X, g) £ {X x 3^)" x 
(0, oo) X H. Fix any f £ H and denote h := K~j^ ^ nif)- Then, 

\\KDix,Jf)-Kn\,M\H = \\KDix,jKD,xAh))-h\\^ = 
= \\KDlx„jKn,xAh)-KD,M,gM\\H < 

p8l l X 



„. f\KDXgW - KD,,x,,g,ih)\\ > 



according to (27). That is, {D,X,g) i— ;• K^^ ^ (f) is continuous for every 
fixed f £ H. Since / i— ;■ K^^ if) is continuous for every fixed {D,X,g), 
the function (^{D,X,g),f) i— ;■ K^ -^ (f) is a Caratheodory function and, 
therefore, measurable. Since /D„,An i^ measurable as shown above and 
Kt>„,a„ = -?^d„,a„,/d„,a„, it follows that K^l^^^{<^iXi)) is measurable. 
Finally, measurability of the estimator ^p'■Q ^^ , measur ability of /d„,a„) and 
measurability of K^ ^ (<l>(Xj)) imply measurability of 

ffD„,A„(X„y,) = -L'(X„y„/D„,A„(^.))(V'U.,A„,i^Dl,Aj^(^0)>/^- 

D 

Proof of Theorem 13. 6t For every j £ { 1 , . . . , m} , let ip'r £ H denote 

the j'-th component of ip'^ and, accordingly, let "fADn A„ j ^ -^' ffDn.Anj) 
and gp,Xo,j denote the j-th component of V'd a > 5d„,a„, and gp^Xo respec- 
tively. Define Zi = {Xi,Yi) for every i S M. Measurability of 5d„,a„j(-^j) is 



shown in the proof of Lemma 7.1 Define a := ||/p,AoIIoo + 1 S [liCo) and 
c := max,- \\ih'f 11 „ • \\Kp \\ ■ \\k\\oo where \\Kp 11 denotes the operator 

■^ 1 1 jP,Aq,j 1 1 -o II -Til II II II r^ \\ 

norm of the continuous linear operator Kp . Then, the definition of gp^XoJ 
and ([3]) imply 

\9P,XoA^)\ < c-fe'a(^) yz£Xxy. (29) 
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Hence, gp,\o,j is P-square integrable. Fix any j,i G {1, . . . ,m}. We have to 
show 

1 " 

-Y.9n^^r.AZ^) ^^ ^9pmAZi)] (30) 

1 " 






According to [iQl Lemma A. 9], ^ is a P-Donsker class and, therefore, a P- 
Ghvenko-Cantehi class almost sure; see [271 P- 82]. Hence, sup^gg |Pd„(9) — 
P{9)\ — ^ almost surely and, therefore, there is a measurable set Oq G 
A such that Q(r2o) = 1 and sup^^g |Pd„(w)(5) - -P(5)| — > for every 
a; G r^o; see [27^ §1-9 and Lemma 1.2.3]. Due to the law of large num- 
bers, we can choose Qq € A in such a way that, for every uj £ Qq, in 
addition, ^YA=i9P,\o,j{Zi{^)) and ^Y.'i=i9P,\o,ji^ii^))9P,XoA^ii^)) and 
n Sj"=i ^'ai^ii^)) and ^ l^"=i ^a(-^i('^))^ Converge to their expectations for 
n — >• oo. Furthermore, due to the assumptions on ^^ j^ and A„, the 
set Qq can also be chosen in such a way that, in addition, \\ip'-£f (i_j) a (uj) ~ 
Af \\h — > and A„(a;) — > Aq for every w G J7o- Fix any a; G Qq; define 
Dn := D„(li;) and A„ := A„(ti;) for every n G IN and {xi,yi) := Zi := Zi{uj) 
for every i G M. That is, we have 

lim sup|PdJ9)-^(5)| = 0, lim A„ = Aq , (32) 

lim 1 1 -00 X —Af„. \\u = 0, (33) 

1 " 

iii^;^Zl^^.^oj(^i) = Ep [(7P,Ao,i] , (34) 

1 " 

ii^~X]^^.^oj(^i)ffP,Ao,K^i) = ^p[9pm,j9pmA^ (35) 

hm -J2U^^) = ^Pb'a^ and lim -j^U^^f = ^pK'- (36) 






It is shown in [TUl (46) and (47)] that S : Bs ^ H, fi ^ S{iJ,) = f^^Xo is 
continuous in P and, therefore, 

hm /z,„,A„ ^ hm 5(^Pd„) ^ SiP) = fp,,, . (37) 

n— >oo n— 5-00 ^ ^n / 
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In view of ( 34 ) , it suffices to prove 






-^ 



(38) 



4 = 1 



2=1 



in order to prove ( 30 ) . 



First, it is shown in the following that, for the fixed sequence {zn)neK G 
X X y, there is an rij £ IN and a sequence (ej,n)ne]N C [0, c«) such that 
lim„--s>oo £j,n = and, for every n > Uj and for every z = {x,y) £ X x y, 

\gDn,x„,jiz) - gp^Xo,jiz)\ < £j,n + £j,n ■ b'^iz) . (39) 

To this end, note that it is shown in [T0[ (43)] that /i i— ;■ Kt^^ is continuous 



in P and, therefore, it follows from (32) that 
}20l Ao 



K 



-1 



Xr. 



-K\ 



XT-Pon n^oo ' 



-^ Kl 



in operator norm. 



The definitions imply 



|to„,A„,i(^)-5P,Ao,i(2)| < 



(40) 



(41) 



+ 



■\L', 



If 



Due to ([37j), there is an rij e IN such that Wfor^xAoo < ||/p,AoIIoo + 1 = a 
for every n > n^. Hence, the first summand converges to uniformly in 



z£X xy because of ([37]), ||$(x)||i:^ < ||A;||oo Vx G A", ([33]), ^, and 

KV'n A .^-fiTn^ i'^ix)))H\-\L'f (z)-L'f (z)\ < 



li^L 



Dn,X„,j\\H 



\K 



Dji-^An 



sup ||$(x)||h • &a • ||/d„,a„ - fPM 



ieX 



For V* ,■ G ^5 let "0* J ° -^u ^ denote the continuous linear operator h i— )■ 
{ip'^^j,K^^(h))H- Then, the second summand in (41) is bounded via 

^"d ||V'i)„,A„,i ° -^i^iAn - ^/p,Ao ,i ° ^P^W ■ ^^PseA' II^(*)IIh converges to zero 



because of ||$(a;)||H < \\k\\oo Vx G X, (33), and (40). This proves that 
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we can choose a null-sequence (ej,n)neiN C [0,oo) such that (39) is fulfilled 
for every n > rij and every z G X x y. Accordingly, there is an n^ G M 
and a sequence (e£,n)neiN C [0, oo) such that lim„_>oo £i,n = and, for every 



n > rii and for every z G X x y, assertion (39) with j replaced by £ is 



fulfilled. Then, due to (29), 



\gD„MAz)\ < e^,„ + (c + ££,„)• 61(2) WzGXxy, n>ni. (42) 



Define e„ := m.a.x{e j^n,£i.n} for every n S IN. Then, for every n > 



n, 



■J' 






<@ 1 

~ n 



j=i 



«=i 



^(e„ + e„-6;,(zi)) 



-> 



i=l 



where convergence to follows from lim„_j>oo £n = and (36). That is, we 

have provenjSOl). 

In view of (|35[), it suffices to prove 

-y2sD„,x„,jiz^)gD„,x„,i{zi) - -y29P,Xojizi)gp,XoA^i) ^ 0- (43) 



j=l 



in order to prove (31). According to (29), (39), and (42), 



-^^gDr^Xnji^dgD^MA^i) — '^gp,xodizi)gpMA^i) 
1 " 

< - X] \gD„,Xn,jizi) - gP,Xo,j{zi)\ ■ \gD„,x„Azi)\ + 

"■ i=l 

1 " 

^ n^ |5'p,Aoj(^*)| • \gD„,x„Azi) - gp,xoAzi)\ < 

1 " 

< -^{£n + £n-b'a{zi))-{en + ic + en) -b'^izi)) + 

"- i=l 

1 " 

+ - ^ C • ^^(Zi) • (e„ + En ■ b'^{Zi)) = 

1=1 
= el + 2{enc + el) ■ -V b'^{zi) + (2e„c + el) ■ -V b^Zi) 



i=l 



i=l 



and the last line converges to as lim„_>oo ^n = and due to ( 36 ) . 



D 
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Proof of Remark 13. 7t According to [101 Lemma A. 9], G is a P-Donsker 
class and, therefore, a P-Glivenko-Cantelli class almost sure; see [571 P- 82]. 
Hence, Pd„ converges to P in Bs almost surely. It is shown in [10', (46) and 
(47)] that 5 : Bs -^ H, n ^ S{fj,) = f^^x^ is continuous in P and, therefore. 



/d„,/ 



m 



s&^.) 



-^ S{P) = fPM ■ 



u 



Lemma 7.2 Let the Assumptions 3.1 he fulfilled, let Aq G (0,oo), and let 
tp : H ^ K,™ he Hadamard-difjerentiable in fp^Xg with derivative ip'r 
For every j £ {1, . . . ,m}, let tp'r ■ G H denote the j-th component of 
ip'r . Let Tip E ]l^"»<™- 5e the covariance matrix in Corollary 



3.3. Assume 



that, for Px{dx) -a.e. x £ X, there are yi,y2 £ supp(P((iy|x)) such that 

L'{x,yi,fp,Xo{x)) j^ L'{x,y2,fp,Xo(.x)) . (44) 

Then, 
Tp has full rank ^ ^a eR"" \{0} s.th. a'^ip'^^^ =OPx-a.s. (45) 

Proof of Lemma 17. 2t According to ([5|, we have 

Sp has fun rank ^ ^ a E R™ \ {0}, c E IR : a'^gp^Xo = c P-a.s. (46) 

It is a direct consequence of the definition of the continuous linear operator 
Kp that Kp is self-adjoint and, accordingly, Kp is again self-adjoint; see 
[9l Lemma VI. 2. 10]. Hence, according to the reproducing property, we get 

5P,Ao,i(x,2/) = -L'j^^^{x,y){^'j^^^^^,Kp'mx)))H = (47) 

= -I^'f.J^'yK^p'(^'f...JM^))H = -L'f,Jx,y)[Kp\^P'J^^^J]ix) 

and, therefore. 



a 9p,Xo 



-L' 



fPAo 



Kp\a''i>' 



If 



ya£ E" 



It will be shown below that, for every f £ H, 

f = P-a.s. ^ Kp^{f) = P-a.s. 



(48) 



(49) 



By use of these preparations and (49), the proof of (45) can be done quickly: 
First, assume that there is an a E R™ \ {0} such that a^^'r = P;^-a.s. 
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Then, it follows from (49), (48), and (46) that Sp does not have full rank. 



That is we have proven "=^" in (45). Next, in order to prove "<^" in (45), 



assume that Sp does not have full rank. Then, according to (46) and (48), 



there is an a S IR™" \ {0} and a c G IR such that, for Px{dx)-&.e. x ^ X 
-L'f^^^^{x,■)[Kp\a'^l,'^^J]{x) = c P(-|x)-a.s. 



Hence, for Px{dx)-a.e. x G Af, it follows from (44) and continuity of y i— )■ 
L'f (x, v) that 



[KpHa'i^'f,,J]{x) 



. 



(50) 



According to (|49|), this implies that a^ip'r = P;f-a-s. That is, we have 



proven 



in (H5t . 



Now, it only remains to prove statement (49). To this end, define X : 



supp(Pa'), let Pp^ be the restriction of Px on the Borel-cr- algebra of X, and 
let P be the probability measure on X x y defined by 



P(B) 



lBix,y) P{dy\x) P-^{dx 



for every B in the Borel-ci-algebra of X x y. In addition, let k be the 
restriction of the kernel k on X x X and $ the corresponding canonical 
feature map. Then, the RKHS of k is 

H := |/:Af— F-IRl/is the restriction of an / G if on X^ ; 

see e.g. [U §4.2]. For every f £ H, let / denote the restriction of / on X. 
Define 



Kp : H 



H. 



f ^ 2Ao/+ L'.(x,y)f{xMx)P{d{x,y)) 



As Assumption 



3.1 



is also fulfilled for X and P instead of X and P, it 
follows from jlOl Lemma A. 5] that Kp is invertible. The definitions imply 
K{f) = Kp (/) for every f G H and, therefore. 



KplKp'if)) = Kp{K-\f)) = f yfGH. 



Hence 



Kp\f) = Kp \f) y f€H. 



(51) 
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Since k is continuous and supp(P^) = X, it follows from [211 Exercise 4.6] 
that, for every f £ H, 



/ = P;t-a.s. 
Hence, for every f £ H, 



<^ 



/ = 



(52) 



K-\f) = Px-a.s. 



Kp\f) = 



^ / = 



Kp ^(/) = ^ 
/ = Px-a.s. 



D 



Proof of Theorem 13. 9t Since taking the square root of a symmetric pos- 
itive definite matrix is continuous, see e.g. |181 §7.8, Exercise 1], it follows 

from Theorem 



3.6 



that i;„(D„,A„ 



— > Sp almost surely for n — t- oo. 
Hence, Corollary |3.3| yields 

/n-S„(D„,A„)-i(^^(/D„,A„) -V'(/p,Ao)) -^ Mm{0,Umxm); 

see e.g. [26\ p. 11]. Finally, weak convergence, the continuous mapping 
theorem, the portmanteau theorem, and the definition of the chi-squared 
distribution imply 

lim Qfv(/p,Ao) e Cn,a{'Dn,K) 

= lim Q(\\Vn ■ ^nCDn, KyHlp{fp,Xo) - ^ifD,„An))\\in. < Xm,a 
n— >oo \" \.u ' ' win. 1 

= 1 — a . 

n 



Proof of Prop. 3.10| Let {$(xij, . . . , $(xi^)} be the maximal linearly 
independent subset of {$(xi), . . . , ^{xn)} which defines Bd^^ according to 



( 12 ) and ( 13 ). Fix any x £ X and any y £ y. We have to find a,n f £ H such 



that Kun^xif) = ^{x). (The solution / depends on D^ and A though this 
is not made explicit in the notation.) Hence, by using /(xj) = (/, $(xj))j:/, 

1 " 
#(x) = Kd^M) = 2Xf + -Y,L%^Jxi,yi){f,<^{x,))HHxi) . (53) 



i=l 



Rearranging this equality yields 



1 1 " 



1=1 
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and, therefore, 



/ = ;rT-$(x) + h for some h e lin{$(j;i), . . . , <&(x„)} . (54) 

2a 



Define 



Wi : = 



'l^L"j^^^{xi,yi)k{xi,x) Vi e {l,...,n} 



and w := [wi, . . . , WnY ■ Putting (54) into (53) again and a simple rearrang- 
ing of the resulting equation lead to 

^ n n 



That is, / solves (53) if and only if / is of form (54) where h solves (55) 
Next, define the linear map 



by 



: lin{$(xi),...,$(x„)} -^ lin{$(xi), . . . , $(x„)} 
1 "^ 



for every h € lin|<I>(xi), . . . , $(x„)}. That is, in order to find h which fulfills 
(55 ) we have to find ai, . . . , a„ S II such that 



n \ n 

7( ^ai<^{xi)j = ^Wi^{xi) . 

i=l ^ i=\ 



(56) 



Existence of a solution h and therefore, of ai, . . . , a„ is guaranteed as K^^^x 
is invertible. Let a^^i be the (£, i)-entry of the matrix ^d„,A) ■^, « G {1, . . . , n}. 
According to the definition of Ad^^x, 



7($(xi)) = ^aH^{xe) Vi G {1, . . . ,n} . 
1=1 

It follows from 

n . . n / r \ r / n \ 

i=\ i=\ ^ j=l ' j=l ^ i=l ^ 



(57) 
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and 



'It \ 'It . . fi fi 

i=l ^ i=\ 4=1 l.=\ 

. . n n r r , n n \ 

- Z] "' Z] "« Z] /3i^^(^%- ) = Z] ( Z Z /5i«a^i«i ) ^( 
i=\ i=\ j=i j=i ^ 1=1 e=i ^ 



that a := (ai, . . . ,an)^ is a solution of (56) if and only if 



r / n n \ r , n \ 

Z ( Z Z /5j^«^i"i ) ^(^^.) = Z ( Z ^i^^i ) ^(' 

1=1 ^ j=l l=\ ^ 7=1 ^ i=l ^ 



(58) 



Linear independence of <I>(a;j^), . . . , <I'(xi^) implies that (58) is equivalent to 

71 n n 

^ ^jiu;, = Z Z l^ji^^i^i ^J G {1, . . . , r} 



i=l i=l 1=1 

or, in matrix notation, 

Sn • w = Bd„Ad„,\ ■ a . 



(59) 



Summing up, we have proven that a G R" is a solution of (56) if and only 



if a solves (59). As already stated above, a solution of (56) and, therefore. 



of ( 59 ) exists. Hence, 



a := (-Bd„^d„,a) Bd^w 



solves (59) and, therefore (56). 



□ 
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